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Two  experimental  vocoder  systems  are  described  which  exploit  the 
frame-fill  concept  described  by  McLarnon  to  achieve  data  rates  in  the 
range  of  800  to  1200  bps.  One  is  based  on  a  well-known  2400  bps  channel 
vocoder  design,  the  second  is  based  on  a  form  of  the  Lincoln  2400  bps 
linear  predictive  coder  (LPC-10)  algorithm.  Both  systems  were  found  to 
perform  well  at  the  1200  bps  rate  representing  a  2:1  savings  in  transmission 
bandwidth  at  very  little  additional  algorithm  complexity.  At  800  bps 
both  systems  were  judged  usable  but  not  wholly  satisfactory.  Performance 
of  the  channel  vocoder  was  considered  marginally  better  than  LPC  at  800 
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I .  INTRODUCTIOiM 

Planners  of  advanced  future  communications  systems  continue  to 
challenge  voice  digitization  system  designers  with  demands  for  further 
reductions  in  required  transmission  bandwidth.  Given  the  practical 
balance  which  must  be  struck  between  algorithm  performance  and  imple- 
mentational  complexity,  2400  bits/sec  represents  a  reasonable  lower 
limit  on  the  capability  of  speech  bandwidth  compression  technology 
today.  Even  at  this  rate,  performance  is  still  a  major  issue  within  the 
military  operational  community  where  it  is  frequently  necessary  to 
accommodate  acoustically  noisy  and  heavily  jammed  environments. 

Several  approaches  have  been  or  are  presently  being  pursued  in  an 
effort  to  develop  voice  digitization  algorithms  supporting  data  rates 
well  below  2400  bps.  Based  on  formant  tracking,  vector  quantization, 
pattern  matching,  or  diphone  analysis  concepts,  these  systems  are  still 
highly  embryonic,  experimental,  and  in  some  cases  quite  computationally 
complex. 

This  report  describes  a  near-tei.u,  low- complexity,  low-risk  approach 
to  achieving  data  rates  in  the  800  to  1200  bps  range  based  on  standard 
2400  bps  analysis-synthesis  system  technology.  Using  conventional 
filter  bank-based  or  linear  prediction  (.EPC)  type  systems  as  fundamental 
backbones,  it  is  shown  that  surprising  performance  is  possible  in  this 
bit  rate  range  with  essentially  negligible  increases  in  in.plementational 
complexity. 
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II.  THE  FRAME- FILL  CONCEPT 


The  frame- fill  concept  described  by  McLamon  [1]  represents  a 
conceptually  simple  and  straightforward  approach  to  reducing  the  trans¬ 
mission  bandwidth  requirement  of  a  frame- oriented  digital  voice  system. 

The  basic  idea  is  to  transmit  from  analyzer  to  synthesizer  only  every 
Mth  data  frame,  thereby  achieving  an  approximate  M:1  reduction  in  rate. 

The  savings  is  only  approximate  in  that  some  control  information  must 
also  be  supplied  to  the  receiver  instructing  the  synthesizer  how  to 
reconstruct  (or  "fill  in")  missing  information  given  some  pre- agreed 
fixed  set  of  options. 

This  idea  is  of  great  interest  when  applied  to  2400  bps  vocoder 
systems  which  represent  a  practical  lower  bound  on  the  data  rate  capability 
of  today's  voice  digitization  algorithm  technology.  By  choosing  M  =  2, 
a  1200  bps  data  rate  would  be  achieved.  If  M  =  3,  an  800  bps  system 
would  appear  feasible.  When  the  mechanics  of  human  speech  production 
and  perception  are  taken  into  consideration,  M  =  2  is  the  most  reasonable 
choice  if  starting  with  a  2400  bps  system.  This  stems  from  the  fact 
that  narrowband  vocoders  typically  operate  at  fundamental  frame  production 
rates  of  about  50  Hz.  This  is  close  to  a  practical  minimum  if  essential 
phonemic  transitions  in  the  speech  are  to  be  reasonably  preserved.  If 
more  than  50%  of  the  vocoder  analysis  frames  are  omitted,  it  is  effectively 
impossible  to  avoid  unacceptable  losses  in  this  vital  transitional 
information  which  bears  so  significantly  on  intelligibility. 

This  report  will  focus  its  attention  on  nominal  2400  bps  systems  with 
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M  =  2;  i.e.,  every  other  frame  produced  by  the  vocoder  analyzer  will  be 
omitted  from  the  transmitted  data  stream.  Given  this  constraint,  the 
following  general  set  of  rules  will  be  applied  in  determining  the  best 
"fill  in"  option  at  the  vocoder  synthesizer: 

(1)  Compare  the  frame  of  data  to  be  omitted  with  the  frames  im¬ 
mediately  preceding  and  succeeding  in  the  temporal  sense. 

(2)  In  accordance  with  some  "reasonable"  distance  metric,  decide 
which  neighbor  matches  the  frame  to  be  omitted  most  closely. 

(3)  Also  consider  as  a  match  candidate  some  weighted  combination 
of  the  information  contained  in  the  two  neighboring  frames. 

(4)  Select  the  option  (3  choices)  representing  the  best  match  and 
append  its  I.D.  code  (2  extra  bits)  to  the  frame  which  is  to  be 
transmitted. 

In  practice  these  rules  apply  most  naturally  to  the  vocal  tract 
parametric  information.  In  a  channel  vocoder  this  corresponds  to  the 
spectral  samples,  in  an  LPC  vocoder  the  K-parameter  set.  Also,  the  3rd 
fill-in  option  is  usually  constrained  to  be  a  simple  average  of  the 
neighboring  frame  data.  The  excitation  information,  however,  is  handled 
separately. 

Both  systems  discussed  here  treat  the  pitch  and  voicing  parameters 
in  the  same,  empirically  determined  way.  It  is  known  that  the  voicing 
information  plays  a  critical  role  in  determining  the  quality  and  intelligi¬ 
bility  of  the  synthetic  speech.  Since  it  comprises  only  a  single  bit 
per  frame,  the  penalty  for  transmitting  every  frame  time  is  minimal. 


Given  that  all  the  voicing  information  is  available  at  the  receiver, 
intelligent  decisions  can  be  made  on  reconstructing  the  omitted  pitch 
parameter  (if  needed)  from  information  in  the  neighboring  frames.  Of 
course,  if  the  omitted  frame  happens  to  be  unvoiced,  then  a  pitch  parameter 
may  not  be  needed. 

The  pitch/voicing  reconstitution  strategy  is  summarized  in  Table  I 
where  the  excitation  parameter  is  developed  based  on  the  8  possible 
combinations  of  voicing  bits  that  might  be  encountered.  Frame  "N"  is  to 
be  reconstituted,  and  only  its  voicing  bit  is  available  to  the  receiver. 
Implicit  in  this  strategy  is  some  editing  of  the  voicing  decision  itself 
aimed  at  rejecting  improbable  combinations.  The  pitch  fill-in  approach 
suggested  appears  from  our  experience  to  offer  the  most  favorable 
perceptual  impact. 

Given  this  common  approach  to  dealing  with  the  excitation  information, 
the  next  two  sections  will  focus  on  methods  for  reconstruction  of  the 
vocal  tract  data  unique  to  channel  and  LPC  types  of  2400  bps  backbone 
vocoder. 
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III.  FILTER  BANK-BASED  SYSTEM 

The  channel  vocoder  system  used  as  the  basis  for  this  series  of 
experiments  was  modelled  after  the  UK  JSRU  vocoder  [2]  sometimes  referred 
to  as  the  "Belgard"  algorithm.  This  vocoder  transmits  19  spectral 
samples  per  20  msec  spanning  the  frequency  range  of  200  to  4000  Hz  and 
features  relatively  simple  analyzer  and  synthesizer  bandpass  filter 
designs. 

In  this  framework  the  vocal  tract  parametric  information  assumes 
the  form  of  19  logarithmically  uncoded  spectral  samples.  McLarnon 
suggested  that  the  3rd  fill-in  choice  be  constructed  by  averaging  the 
log  spectral  data  on  a  channel -by- channel  basis.  He  further  suggested  a 
very  simple  distance  metric  of  the  form 

19  _ 

e  =  Z  |S  (k)  -  S  (k)  |  (1) 

1  k=l  c 

where  Sc(k)  and  Sr(k)  are  the  kth  log  spectral  samples  of  the  candidate 
and  reference  frames  respectively.  In  other  words,  this  is  simply  the 
sum  over  all  the  vocoder  channels  of  magnitudes  of  the  differences  in 
the  spectral  samples.  This  simple  metric  offers  the  obvious  advantage 
of  being  very  easily  computed. 

In  the  course  of  experimentation  at  least  three  other  metrics  of 
similar  complexity  were  tried  as  summarized  below. 

19  . 

e  =  Z  |S  (k)  -  S  (k)|-S  (k)  (2) 

k=l  c  r  r 
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e3  =  Max|Sc(k)  -  S  (k)  | 
k 
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e4  =  Z  |{S  (k+1)  -  SJk)}  -  (Sr(k+1)  -  Sr(k)}|  (4) 

k=l 

These  were  developed  in  the  hope  of  finding  one  offering  a  more  favorable 
overall  perceptual  impact.  Metric  (2)  represents  a  weighted  version  of 
(1)  where  a  given  spectral  difference  influences  the  overall  metric  in  a 
manner  consistent  with  the  spectral  amplitude  at  that  point  in  frequency. 
Metric  (3)  seeks  to  minimize  the  maximum  single  point  error  over  the  3 
fill-in  alternatives.  Metric  (4),  suggested  by  Klatt  [3],  is  designed 
to  emphasize  differences  in  spectral  slope. 

Experience  indicated  little  performance  difference  among  (1),  (2), 
and  (3)  with  a  slight  preference,  based  on  informal  listening,  emerging 
for  (2).  Metric  (4),  however,  was  found  to  be  decidedly  inferior.  A 
possible  reason  is  failure  to  take  into  account  total  spectral  energy 
where  this  information  is  implicit  to  some  extent  in  each  of  the  other 
metrics.  It  is  therefore  possible  that  two  spectra  with  wildly  differing 
energy  content  could  be  equated  because  of  similarities  in  formant 
structure.  Metric  (4)  by  itself  was  therefore  not  considered  satisfactory 
for  present  purposes. 

Both  1200  and  800  bps  versions  of  the  channel  vocoder  system  were 
developed  as  summarized  in  Table  II,  The  1200  bps  system  is  based  on 
the  2400  bps  system  as  shown.  Here  48  bits  are  transmitted  per  20  msec 
and  2  are  normally  uncommitted.  A  2-bit-per-channel  DPCM  type  of  coding 
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scheme  is  used  to  represent  the  channel  weights.  The  lowest  frequency 
channel  (240  Hz)  is  used  as  a  starting  point  and  is  coded  in  a  3-bit  log 
PCM  format.  The  1200  bps  variant  transmits  48  bits  each  40  msec.  The 
two  formerly  uncommitted  bits  are  used  to  convey  the  spectral  fill-in 
option  dictated  by  the  analyzer.  The  highest  frequency  channel  is 
represented  as  a  1-bit  DPCM  datum  thereby  freeing  a  bit  to  represent  the 
voicing  state  of  the  omitted  frame.  The  transmitted  frame,  therefore, 
contains  6  pitch  bits  (log  coded),  2  voicing  bits,  2  fill-in  control 
bits,  and  38  spectrum  bits. 

The  800  bps  version  is  also  a  2:1  reduction  system  and  therefore 
starts  with  a  1600  bps  backbone.  If  all  of  the  spectral  information  is 
to  be  included,  the  minimum  number  of  spectral  bits  possible  (relying  on 
1-bit  DPCM  coding)  is  3  +  18  =  21.  Given  the  remaining  information  that 
must  be  included  (6  +  2  +  2  =  10),  the  net  minimum  for  a  fill-in  system 
would  be  31  bits/frame.  At  20  msec/frame,  this  results  in  a  net  data 
rate  figure  of  1550  bps. 

The  1-bit  DPCM  coding  allows  step  sizes  normally  fixed  at  _+  6  dB, 
a  reasonable  compromise  in  most  cases.  However,  the  1-bit  DPCM  coding 
does  a  significantly  poorer  job  of  representing  the  spectrum  than  2-bit 
DPCM.  An  effort  was  made  to  improve  the  quality  of  the  1600  bps  spectral 
representation  by  permitting  some  flexibility  in  the  choice  of  step  size 
for  the  1-bit  DPCM  coding.  A  system  was  developed  which  permitted  step 
size  choices  of  3,  4.5,  6,  and  7.5  dB.  The  step  size  choice  was  determined 
by  comparing  the  coded  spectrum  with  the  uncoded  reference  for  each  of 
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the  4  options.  A  distance  metric  similar  to  the  one  used  for  frame 
fill-in  purposes  was  used  as  the  determinant.  Once  the  best  fit  was 
determined,  then  the  normal  2:1  frame  rate  reduction  process  was  put 
into  effect.  Again  metric  (2)  was  found  to  be  the  most  useful.  Given 
four  choices  of  step  size,  2  more  control  bits  were  appended  to  the 
transmit  frame  for  a  total  of  33.  To  reduce  the  data  rate  to  exactly 
1600  bps,  the  frame  period  was  lengthened  slightly  to  20.625  msec. 

In  considering  ways  to  further  improve  quality,  it  was  realized 
that  the  6  pitch  bits  were  being  effectively  wasted  during  unvoiced 
frames.  It  was  decided  to  give  these  bits  over  to  the  spectral  representa¬ 
tion  during  unvoiced  frames  resulting  in  a  combined  l-bit/2-bit  DPCM 
spectral  coding  where  channels  2  through  7  received  the  2-bit  accuracy. 
Although  in  benign  environments  spectral  detail  is  not  as  critical  in 
consonantal  sounds  as  it  is  during  vowels,  there  is  some  evidence  that 
it  is  of  importance  in  noisy  backgrounds.  The  resulting  800  bps  system 
is  summarized  at  the  bottom  of  Table  II. 

The  1200  bps  system  was  found  to  perform  quite  well  based  on  informal 
listening  tests  using  high  quality  (acoustically  quiet  background, 
dynamic  microphone)  input  material.  In  many  instances  the  1200  bps 
output  could  barely  be  distinguished  from  the  2400  bps  parent.  The  800 
bps  system  was  observed  to  do  surprisingly  well  for  its  rate  and  complexity 
but  was  generally  judged  markedly  inferior  to  the  1200  bps  version. 

Under  relatively  benign  conditions  it  is  probably  usable,  especially  in 
the  hands  of  properly  trained  personnel.  Under  the  degraded  conditions 
typical  of  many  military  environments  it  would  probably  not  be  acceptable. 
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IV.  LINEAR  PREDICTION-BASED  SYSTEM 


Low  rate  systems  based  on  Linear  Predictive  Coder  (LPC)  backbones 
are  of  particular  interest  within  the  DoD  community  since  this  class  of 
algorithm  has  been  selected  as  the  standard  for  2400  bps  applications, 
and  all  interoperability  criteria  are  based  on  a  particular  LPC  formulation. 
The  experimental  system  described  here  is  based  on  the  Lincoln  10th 
order  autocorrelation  LPC  [4]  modified  to  be  consistent  with  the  essential 
elements  of  the  DoD  interoperability  specification  [5,6].  Modifications 
include  22.5  msec  non- over lapped  framing,  digital  audio  conditioning,  a 
4th  order  LPC  fit  during  unvoiced  frames,  and  implementations  of  NSA- 
specified  coding  tables  for  pitch/voicing,  energy,  and  K-parameters. 

Although  the  philosophical  approach  taken  is  identical  to  that 
employed  in  the  channel  vocoder,  the  parametric  information  developed  in 
an  LPC  system  is  unique  and  must  be  given  special  consideration.  Por 
example,  the  vocal  tract  is  represented  by  a  10th  order,  all-pole 
digital  filter.  The  filter  is  characterized  for  transmission  purposes 
in  terms  of  10  K-parameters  which  can  be  applied  directly  at  the  synthe¬ 
sizer  if  the  synthesis  filter  is  implemented  in  a  lattice  ("acoustic 
tube")  form.  Alternatively,  the  filter  could  be  implemented  in  a  direct 
form  requiring  the  so-called  direct  form  ("a")  parameters. 

There  are  many  possible  distance  metrics  that  have  been  discussed 
in  the  literature  which  could  be  developed  from  these  or  other  parametric 
representations.  In  a  study  by  Barnwell  [7]  metrics  of  the  form  shown 
below  were  evaluated  in  terms  of  perceptual  impact  on  human  subjects. 
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Possible  choices  of  parameter  sets  included  a- parameters  ,  k-parameters  , 
area  ratios,  and  log  area  ratios.  It  was  found  that  the  choice  of  K- 
parameters  with  N^l  ^ave  good  correlat  ion  with  perceptual  feedback. 

Unit  is,  the  parameter  set  that  mmimices  i  when  based  on  the  sum-of- 
magnitudcs  differences  m  k-paraiueter  space  also  sounds  the  closest  to 
the  reference  set.  Hus  fortunate  result  leads  to  a  metre  for  I. I’C 
systems  which  has  identical  form  to  that  of  the  channel  vocoder: 
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In  the  spirit  of  t lie  channel  vocoder,  the  fill-m  options  permitted 
were:  fill  forward,  fill  backward,  or  fill  with  an  average  of  the  k- 

parumeter  sets. 

However,  the  l.l’C  formulation  includes  a  separate  parameter  which 
takes  into  account  total  spectral  energy  content.  In  the  experimental 
system  the  energy  parameter  was  treated  separately  and  was  reconstructed 
independently  of  the  spectral  data  as  shown  below: 


log  l.r. 


where  the  fill-in  options  were  constrained  to  be  fill  forward,  fill 
backward,  or  fill  with  the  average  of  the  log  energies. 


i: 


All  of  these  metrics  feature  the  desirable  property  of  extreme 
computational  simplicity,  and  there  is  no  need  to  develop  intermediate 
functional  representations  (e.g.,  Itakura-Saito  metric).  Also  implicit 
is  the  requirement  for  2  sets  of  fill-in  control  bits  since  energy  and 
spectrum  are  treated  separately. 

Both  1200  and  800  bps  systems  were  developed  and  evaluated  as 
summarized  in  Table  Ill.  Notice  that  pitch  and  voicing  in  the  1200  bps 
version  are  handled  exactly  as  they  were  in  the  channel  vocoder.  Four 
fill-in  control  bits  are  supplied  to  decouple  energy  from  spectrum.  The 
parameter  coding  strategy  shown  in  the  table  represents  a  necessary 
departure  from  the  Dol)  standard  [5]  due  to  the  extra  5  bits  required  for 
fill -in  and  voicing  control.  K3  has  been  reduced  from  5  to  4  bits,  Kb 
from  4  to  3,  K7  from  4  to  2,  and  K8  from  3  to  2.  These  choices  evolved 
empirically  through  informal  listening  tests  conducted  with  trained 
personnel . 

The  800  bps  system  was  much  more  difficult  to  implement  given  the 
lack  of  a  truly  efficient  parameter  coding  scheme  akin  to  the  1-bit  Dl’CM 
technique  applied  in  the  channel  vocoder.  To  create  the  necessary  1600 
bps  backbone,  18  out  of  54  bits  per  frame  must  be  discarded.  To  accomplish 
this  K3  through  k'J  were  reduced  to  1  bit  each,  energy  and  k0  through  k2 
were  reduced  to  4  bits  each,  and  K3  and  k4  were  dropped  to  3  and  2  bits 
respectively.  Ibis  was  judged  to  be  the  maximum  that  could  he  stripped 
from  the  spectral  representation  if  10  K-parameters  were  to  be  retained. 

To  gain  the  remaining  number  of  bits  required,  the  explicit  sync  bit  was 
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dropped,  the  present-frame  voicing  bit  was  packed  into  the  6-bit  pitch 
word  (by  sacrificing  one  pitch  code) ,  and  the  number  of  control  bits  was 
reduced  from  4  to  3.  The  reduction  in  control  bits  was  accomplished  by 
noting  that  4  bits  were  being  used  to  represent  3x3=9  combinations 
of  spectral  and  energy  fill-in  choice.  Statistics  were  gathered  over  a 
large  body  of  speech  data  indicating  that  one  of  the  9  combinations 
occurred  rather  infrequently.  This  case,  where  encountered,  was  auto¬ 
matically  mapped  into  one  of  the  remaining  8  choices.  With  only  8 
permitted  combinations,  3  control  bits  suffice. 

As  in  the  case  of  the  channel  vocoder,  informal  listening  tests 
indicated  that  the  1200  bps  LPC  performed  nearly  as  well  the  2400  bps 
version.  However,  the  800  bps  version  was  judged  to  be  inferior  to  its 
channel  vocoder  counterpart.  This  is  probably  due  to  the  fact  that  a 
crude-but-efficient  spectral  coding  scheme  analogous  to  1-bit  DPCM  is 
not  presently  known  for  LPC  parameters. 
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V. 


FORMAL  PERFORMANCE  EVALUATIONS 


Both  the  filter  bank  and  LPC-based  systems  were  subjected  to 
formal  evaluation  through  the  diagnostic  rhyme  test  (DRT).*  Each  system 
was  tested  at  rates  of  2400,  1200,  and  800  bps.  The  source  material  was 
comprised  of  3  male  talkers  in  an  acoustically  quiet  background  using  a 
high-quality  dynamic  microphone.  The  results  are  summarized  in  Table  IV. 

Based  on  these  results  alone,  it  appears  as  though  both  1200  and 
800  bps  data  rates  are  usable  in  relatively  benign  environments  although 
some  informal  in-house  communicability  tests  tended  to  refute  this 
conclusion  at  800  bps.  Note  that  the  LPC-based  systems  scored  slightly 
below  the  channel  vocoder  at  1200  bps  and  800  bps  which  is  in  agreement 
with  impressions  gained  through  informal  listening.  As  stated  previously, 
this  is  probably  due  to  the  lack  of  a  truly  efficient  parameter  coding 
scheme  for  LPC  equivalent  to  the  DPCM  spectral  coding  approach  used  in 
the  channel  vocoder. 


*I)RT  scoring  services  provided  by  RADC/i;EV  speech  laboratory. 
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TABLE  IV 


SUMMARY  OF  3-SPEAKER  DRT  RESULTS 


RATE  (BPS) 
2400 
1200 
800 


CHANNEL  VOCODER 

89.6 

87.6 
84.0 


LPC  VOCODER 
91.6 
85.1 


82.0 


VI.  RATE  COMPATIBILITY  AND  EMBEDDED  CODING 


The  reduced  rate  systems  described  evolve  as  simple  extensions  of 
their  respective  2400  bps  parents  and  are  parameter-compatible  with 
them.  Given  this  close  relationship,  it  is  reasonable  to  consider 
whether  or  not  the  frame-fill  approach  is  consistent  with  so-called 
embedded  coding  concepts  and  whether  high  and  low  rate  versions  of  the 
same  system  can  be  made  to  intercommunicate  in  some  reasonable  (though 
probably  suboptimum)  way.  Given  the  freedom  to  design  properly  trans¬ 
mission  protocols  and  formats  a  priori,  the  answer  would  appear  to  be 
positive  as  discussed  below. 

The  concept  of  embedded  coding  makes  the  most  sense  in  the  context 
of  an  intelligent  connectivity  medium  such  as  a  packetized  digital 
multifunction  communications  network  [8],  The  ARPANET  and  JTIDS  are 
examples  of  this  type  of  advanced  and  sophisticated  system.  The  basic 
idea  is  to  produce  a  high  rate  data  stream  at  the  transmission  source  in 
which  is  embedded  one  or  several  lower  data  rate  streams.  The  network, 
having  knowledge  of  which  data  is  essential  and  which  is  expendable,  can 
delete  the  less  critical  information  according  to  some  pre-agreed  set  of 
priorities  if  circumstances  warrant.  It  might  do  this  in  response  to 
fluctuations  in  available  channel  capacity  caused  by  heavy  traffic  or 
severe  jamming.  In  the  case  of  a  nominal  2400  bps  voice  digitizer,  the 
network  could  cut  the  data  rate  effectively  in  half  by  invoking  the 
frame- fill  mode  if  a  discretionary  mechanism  were  available  for  it  to  do 
so. 
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The  channel  vocoder  will  be  considered  an  example  of  how  this  might 
be  accomplished.  The  transmitter  could  be  modified  to  perform  the 
frame-fill  control  computations  on  successive  pairs  of  frames  at  all 
times  and  form  a  pair  of  packets  as  shown  in  Fig.  1.  Packet  is  the 
high  priority  member  of  the  pair  and  contains  the  essential  information. 

It  is  comprised  of  the  usual  pitch  (P  )  voicing  (V  ) ,  and  spectral  data 
(S^).  However,  it  is  augmented  to  include  the  voicing  bit  from  the 
contiguous  frame  (V  ^)  and  a  2-bit  control  field  (C  )  indicating  how 
best  to  reconstruct  the  pitch  and  spectral  data  for  the  contiguous  frame 
(cf.  Section  III).  The  actual  pitch  and  spectral  information  for  that 
frame  along  with  miscellaneous  unused  bits  are  contained  in  the  lower 
priority  packet,  P 0.  The  two  packets  together  account  for  96  bits  which 
are  transmitted  in  a  40  msec  epoch.  Thus  the  transmitter  produces  a 
constant  2400  bps  data  stream  from  which  appropriately  designated  informa¬ 
tion  can  be  stripped  at  will  reducing  the  net  data  rate  to  1200  bps  (for 
this  particular  format  convention). 

It  remains  for  the  network  to  notify  the  receiver  when  the  data 
rate  has  been  so  modified,  which  is  an  easy  task  for  a  medium  of  this 
presumed  type.  The  receiver  will  then  invoke  its  frame- fill  logic  to 
operate  on  the  data  actually  received  in  accordance  with  the  techniques 
described  in  Section  III.  If  the  data  rate  has  not  been  modified,  the 
receiver  will  ignore  the  frame-fill  control  data  and  operate  normally  at 
the  2400  bps  rate. 

It  is  also  interesting  to  consider  the  possibility  of  rate  compatibility 


between  1200  and  2400  bps  variants  of  the  same  generic  vocoder  type 
given  unsophisticated  connectivity.  It  would  be  desirable,  for  example, 
for  a  1200  bps  source  to  be  able  to  communicate  with  both  1200  and  2400 
bps  receivers  without  any  specific  knowledge  of  which  type  might  actually 
be  at  the  other  end  of  the  link.  Conversely,  it  might  be  useful  to  have 
a  1200  bps  receiver  which  can  absorb  data  from  both  1200  and  2400  bps 
sources  without  any  special  control  interactions. 

Considering  again  the  channel  vocoder,  assume  that  the  transmitter 
whether  in  a  1200  or  2400  bps  mode  produces  frames  of  data  formatted  as 
shown  in  Fig.  1(a).  If  operating  at  2400  bps,  one  frame  will  be  transmitted 
each  20  msec;  at  1200  bps  the  transmission  epoch  will  be  40  msec.  Also 
assume  that  in  the  2400  bps  mode  the  bit  stream  is  arranged  to  interlace 
bits  on  a  frame  pair  basis  as  indicated  in  Fig.  2.  This  arrangement 
insures  that  each  bit  of  frame  n  is  immediately  succeeded  in  the  transmission 
stream  by  the  corresponding  bit  of  the  contiguous  frame  (n  +  1).  Implicit 
in  this  strategy  is  that  each  voicing  bit  is  transmitted  twice  and  that 
synchronization  is  based  on  a  frame  pair  (96  bits).  In  the  1200  bps 
mode,  there  is  no  interlacing. 

It  is  necessary  now  to  consider  the  four  possible  situations  that 
could  occur:  2400-^2400,  2400->1200,  1200-»2400,  1200-^1200.  If  the  trans¬ 
mitter  and  receiver  rates  are  matched,  there  is  no  problem.  If  a  2400 
bps  source  is  being  received  by  a  1200  bps  sink,  only  every  other  bit 
will  be  received.  The  interleaved  format  guarantees  that  every  other 
frame  will  be  absorbed  in  its  entirety.  Since  frame-fill  data  is  present 


20 


38 


K 

Vn+1 

P 

n 

s 

n 

C 

n 

l 

1 

(a). 

6 

Priority  PI  packet  (48  bits) . 

38 

2 

Pn+1 

S  , 
n+1 

(b) .  Priority  P0  packet  (48  bits) . 


Fig.  l(a-b).  Prioritized  packets  for  embedding  a  channel  vocoder. 
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Fig.  2.  Bit  interleaving  within  a  frame  pair. 
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in  every  frame,  no  special  synchronization  strategy  is  necessary  and  the 
1200  bps  synthesizer  can  function  in  its  usual  way. 

If  a  1200  bps  source  is  being  received  by  a  2400  bps  sink,  it  will 
clock  in  each  received  bit  twice.  Since  it  assumes  an  interlaced  format, 
it  will  de-multiplex  two  identical  frames  and  synthesize  accordingly. 

Speech  quality  will  be  considerably  poorer  than  it  would  be  with  the 
frame-fill  mechanism  operative,  but  the  link  will  probably  be  usable. 

On  the  other  hand,  the  receiver  could  be  equipped  with  frame-fill  logic 
and  some  means  could  be  provided  for  it  to  determine  trivially  that  it's 
connected  to  a  1200  bps  stream.  This  might  be  accomplished  by  noting 
that  successive  bits  are  always  pair-wise  identical,  or  a  special  bit 
in  the  transmission  format  could  be  provided  as  a  rate  ID. 

The  principles  described  above  could  be  applied  equally  well  to  an 
LPC- 10  type  of  vocoder.  However,  the  transmission  format  presently 
specified  in  the  DoD  narrowband  interoperability  standard  is  not  appropriate 
for  supporting  this  kind  of  flexibility  although  there  are  no  conceptual 
problems  with  the  vocoder  algorithm  per  se.  A  modified  format  would  be 
necessary  if  these  compatibility  features  were  deemed  essential. 


VII.  SUMMARY  AND  CONCLUSIONS 


Methods  have  been  described  based  on  the  principle  of  frame  fill- 
in  for  developing  reduced  rate  transmission  systems  from  standard  2400 
bps  backbones.  It  was  shown  that  both  channel  vocoder  and  LPC  types  of 
vocoders  could  be  adapted  with  virtually  no  increase  in  computational 
complexity  to  operate  at  1200  or  800  bps.  The  compatibility  of  this 
approach  with  embedded  coding  concepts  was  discussed. 

It  was  found  through  informal  and  formal  evaluation  methods  that 
both  channel  vocoder  and  LPC-based  systems  perform  quite  well  at  1200 
bps  and  would  probably  be  usable  in  most  environments  where  the  2400  bps 
parent  could  be  successfully  operated.  At  800  bps  both  systems  were 
considered  marginal  and  usable  only  in  limited  circumstances.  However, 
the  channel  vocoder  was  seen  to  perform  incrementally  better,  probably 
due  to  the  uniquely  efficient  parameter  coding  scheme  employed  which 
tends  to  be  less  sensitive  to  quantization  inaccuracies. 
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both  channel  vocoder  and  LPC-based  systems  perform  quite  well  at  1200 
bps  and  would  probably  be  usable  in  most  environments  where  the  2400  bps 
parent  could  be  successfully  operated.  At  800  bps  both  systems  were 
considered  marginal  and  usable  only  in  limited  circumstances.  However, 
the  channel  vocoder  was  seen  to  perform  incrementally  better,  probably 
due  to  the  uniquely  efficient  parameter  coding  scheme  employed  which 
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